Bayesian analysis of Ecological Momentary Assessment (EMA) data collected in adults before and after hearing rehabilitation

This paper presents a new Bayesian method for analyzing Ecological Momentary Assessment (EMA) data and applies this method in a re-analysis of data from a previous EMA study. The analysis method has been implemented as a freely available Python package EmaCalc, RRID:SCR 022943. The analysis model can use EMA input data including nominal categories in one or more situation dimensions, and ordinal ratings of several perceptual attributes. The analysis uses a variant of ordinal regression to estimate the statistical relation between these variables. The Bayesian method has no requirements related to the number of participants or the number of assessments by each participant. Instead, the method automatically includes measures of the statistical credibility of all analysis results, for the given amount of data. For the previously collected EMA data, the analysis results demonstrate how the new tool can handle heavily skewed, scarce, and clustered data that were collected on ordinal scales, and present results on interval scales. The new method revealed results for the population mean that were similar to those obtained in the previous analysis by an advanced regression model. The Bayesian approach automatically estimated the inter-individual variability in the population, based on the study sample, and could show some statistically credible intervention results also for an unseen random individual in the population. Such results may be interesting, for example, if the EMA methodology is used by a hearing-aid manufacturer in a study to predict the success of a new signal-processing method among future potential customers.


A.1 Nominal EMA Situation Responses
Based on the rth recorded EMA response by the nth participant in the tth test phase, the situation is identified by a so-called 1-of-K binary vector z nrt with one element z nrt,k = 1 indicating that the kth situation category was selected, and all other z nrt,j̸ =k = 0.
If the study includes several situation dimensions identifying, e.g., the participant's listening task or the background noise level, the situation response from each participant might also be indexed as z nrt,k 1 ,k 2 ... where k 1 is the index identifying the category in the first situation dimension, and k 2 is the index in the second dimension, etc. The multi-dimensional index (k 1 , k 2 , . . .) can always be uniquely represented by an equivalent linear index k, and vice versa. Therefore, we use mainly the one-dimensional index notation in the mathematical formulation.
The model assumes that the occurrence of various situations in the EMA records are determined by fixed but unknown situation probability vectors u nt = (u nt1 , . . . , u ntK ). Here, u ntk ∈ [0, 1] is the probability that the nth participant reports from the kth situation category at any assessment in the tth test phase. The array of situation probabilities is one of the main results to be estimated from the data.
The individual situation probabilities are assumed to be the same for all EMA records from the same test phase, but may vary between participants and between test phases. The recorded EMA situation responses are assumed conditionally independent, given the parameters. Thus, the responses follow a categorical distribution, conditional on the probabilities, To ensure that u nt is restricted to be a properly normalized probabilitymass vector during the computation, each vector is mapped as a function of a parameter vector α nt = (α nt1 , . . . , α ntK ) with unrestricted real-valued elements, by the softmax function u ntk (ξ n ) = e α ntk K j=1 e α ntj , ∀n, t, k. (A.2) All parameters α ntk for the nth participant are stored as the T K first elements in the individual parameter vector ξ n . Denoting the total array of all situation responses as z n = (. . . , z nrt , . . .), the log-likelihood of all observed nominal situation data for the nth participant is

A.2 Ordinal EMA Attribute Ratings
Based on the ordinal rating to the ith attribute question (with L i ordinal response alternatives), the response is identified by a 1-of-L i binary vector y nri with one element y nri,l = 1 indicating that the nth participant gave the lth ordinal response to the ith question in the rth assessment, and all other y nri,j̸ =l = 0. The rth record also specifies the current situation by the multidimensional situation index vector k(r) = (k 0 , k 1 , . . .)(r). The first index identifies the test phase k 0 = t which is defined by the researcher, and the other indices (k 1 , k 2 , . . .) specify the situation as identified by the respondent, as defined in the previous section. The multi-dimensional array index k(r) can always be uniquely represented by an equivalent scalar linear index k(r), so this notation is used where it is unambiguous. As discussed in section 3.2.3 in the main paper, the ordinal rating responses are analyzed with a variant of Item Response Theory (IRT). In the model, illustrated in Figure 1, each ordinal response is determined by an outcome of a continuous real-valued latent random variable Y nik(r) . The lth ordinal response is given whenever the latent variable falls in an interval τ ni,l−1 < Y nik(r) ≤ τ ni,l . where the thresholds separating the intervals form an increasing sequence (−∞ = τ ni,0 < τ ni,1 , . . . , < τ ni,L i = +∞). The latent variable is drawn from a logistic probability distribution with location θ nik and unity scale.
The location parameters (. . . , θ nik , . . .) for the nth participant include, in principle, a large number of free parameters, one for each combination of nominal categories from all situation dimensions. However, just as in linear regression, the model may be simplified to consider only a linear combination of a smaller set of situation effects, e.g., by expressing the attribute location value as θ nik 0 k 1 k 2 = β nik 0 + β nik 1 + β nik 2 , or θ nik 0 k 1 k 2 = β nik 0 k 1 + β nik 2 . Here, the value β nik f represents the main effect of the k f th category in the f th situation dimension, while β nik 0 k 1 also represents interaction effects between any combination (k 0 , k 1 ) of categories in the zeroth and first situation dimensions. Regardless of these indexing details, the location parameter can always be specified as a fixed linear function θ nik (β ni ) of an effect vector β ni = (. . . , β nij , . . .) including all the user-selected effects to be estimated by the regression model.
All threshold values are determined by a mapping function τ ni,l (η n ) defined in section A.3 such as to ensure that the response thresholds form a strictly increasing sequence for each item. If desired, the mapping can also be specified to yield identical thresholds across different items. Thus, the array η n includes parameters defining all response thresholds for the nth participant.
From now on, we denote the individual ordinal-model parameters β n , η n as parts of the total individual parameter vector ξ n , with an indexing scheme determined by the structure of the model and the desired analysis results.
Finally, using this simplified parameter notation, the conditional probability of any rating response, given the model parameters, is a known function of the parameters, where F ( ) is the cumulative distribution function for a standard logistic random variable, (A.5) All ratings, gathered in an array y n = (. . . , y nri , . . .), are assumed conditionally independent across assessments and questions, given the individual parameters ξ n . Thus the log-likelihood of the observed ordinal response data from the nth participant is Since this regression model allows both the perceptual attribute value θ nik and the threshold parameters τ ni,l to be freely variable for each respondent, the model is under-determined: If a fixed constant value is added to all θ nik and all τ ni,l , the probability (A.4) of observed responses does not change. A weakly informative prior distribution, defined in Section B below, slightly favors solutions with parameters near zero, so the learning always converges. However, the indeterminacy allows some artificial variability in the parameter values. To avoid this variance, the parameter values must be somehow restricted. The current implementation allows the researcher either (1, default) to force the median response threshold to zero, or (2) to force the average attribute value to zero, for each respondent and each attribute.

A.3 Ordinal Response Thresholds
To ensure that the response thresholds form a strictly increasing sequence, it is convenient to define each response-interval width τ ni,l − τ ni,l−1 by its corresponding width F (τ nil ) − F (τ ni,l−1 ) in the range [0, 1], using a mapping function where F ( ) is the logistic distribution function defined in (A.5), and the w(η nij ) are relative interval widths. The widths are mapped from a parameter vector 1 η ni = (η ni1 , . . . , η nil , . . . , η niL i ) with elements in the full range (−∞, +∞). The inverse mapping defines the thresholds as Scaling all widths w(η nij ) by any positive factor does not change the resulting thresholds, so the L i parameter values actually define only L i − 1 free thresholds, as desired.
The width-mapping function must be monotonously increasing and differentiable, but is otherwise not critical. It is chosen for numerical convenience as where ϵ is a tiny positive constant preventing numerical underflow in the computation.

B Population model
All nominal and ordinal response patterns from the nth participant were assumed determined by a parameter array ξ n = (ξ n1 , . . . , ξ nd , . . . , ξ nD ) including three separate classes of parameters, as defined in section A above. As all participants were recruited at random from the same population, as defined by the researcher, the model treats each individual parameter vector ξ n as a sample drawn at random from a population distribution.

B.1 Gaussian mixture model
The individual parameter vectors are assumed drawn from a population distribution in the form of a Gaussian mixture model (GMM), is the mean vector, and λ c = (λ c,1 , . . . , λ c,D ) is the precision vector (inverse variance) of the cth mixture component, with µ = (µ 1 , . . . , µ C ) and λ = (λ 1 , . . . , λ C ) denoting the total set of parameters for all mixture components (in each population model). The selected mixture component is indicated by a latent 1-of-C binary array ζ n = (ζ n1 , . . . , ζ nC ), where ζ nc = 1 indicates that the nth respondent has a parameter vector drawn from the cth mixture component, and all other elements are ζ n,j̸ =c = 0.
It is no serious restriction to assume independent elements (diagonal covariance) in each mixture component, because the complete mixture model can still capture any statistical dependence between elements of ξ n . If several subject groups are included, representing different populations, a separate GMM is defined for each population, although this is not reflected in the math notation here. The number of mixture components may differ among population models.
Of course, it is never known exactly from which mixture component the individual parameter vector ξ n is actually generated. Therefore, the resulting mixture density can be equivalently written as a weighted sum across all mixture components, with weights equal to the means ⟨ζ nc ⟩ = E[ζ nc ] ∈ (0, 1).
In the Bayesian framework, all these parameters are again modeled as random variables with weakly informative prior distributions reflecting the prior knowledge and assumptions about the model.

B.2 GMM parameter priors
A single Gauss-gamma prior is defined for all GMM components in all population models: In the absence of prior information, we assign all m ′ cd = 0, except for the log-probabilities α that are set for a uniform distribution of situation probabilities. The Jeffreys prior for the mean and precision of a Gaussian distribution would suggest ν ′ → 0, a ′ → 0, b ′ d → 0. However, to avoid computational indeterminacy causing numerical overflow in case of extreme response patterns, e.g., if all respondents use only the highest ordinal rating, we must use a weakly informative prior.
The effective weight of the prior on the population mean, relative to the weight of one real test participant, is assigned as ν ′ = 0.5. The prior gammadistribution shape parameter is assigned as a ′ = ν ′ /2, and the inverse scale b ′ d = σ 2 d /2, where σ d is a crude estimate of the magnitude of parameter variations, as specified below. The factor 1/2 is included to conform with the update equations (C.24) and (C.25).
Setting b ′ d clearly greater than zero can have an effect on the Occam's Razor automatic determination of model complexity, because it prevents the Gaussian components from becoming centered on a single data point with infinite precision (zero variance). With any a ′ < 1 the prior probability density for the precision λ cd is concentrated near zero. The prior expectation of the variance 1/λ cd is undefined for a ′ < 1, but the mode is The prior scale of parameters ξ d is assigned as σ d = 1 for all elements related to the logarithmic parameters α nt for the nominal situation probabilities in (A.3). The same prior scale is also assigned for all elements related to the logarithmic parameters η nl for the rating thresholds in (     The plot shows density functions for any dth element related to attribute regression-effect parameters, with hyperparameters ν ′ = 0.5 and two other values, a ′ = ν ′ /2, and σ d = π/ √ 3, as discussed in Sec. B.2. The marginal distributions for the attribute value and the mean are Student-t as shown in (D.2) and (D.1). The distribution of the variance 1/λ cd is inverse-gamma, which has been transformed here to show the pdf for s and ln(s), plotted as a function of s on the logarithmic horizontal axis. For a completely non-informative (improper) prior, the densities for the mean and the log-scale would be uniform, but the present prior was deliberately designed as slightly informative to discourage very large means and very small scales. The density functions have been re-scaled to the same maximum, i.e., they are not normalized.

B.3 Mixture weights
The prior mixture weights are denoted v = (v 1 , . . . , v C ), where element v c is the probability that any random individual in the modeled population has a parameter vector drawn from the cth mixture component, i.e., The prior distribution of mixture weights v is assigned as a Dirichlet distribution with concentration parameters γ ′ = (γ ′ 1 , . . . , γ ′ C ), Here, the normalizing B( ) is the multivariate Beta function. We follow the conventional approach for training Bayesian mixture models and assign fixed small concentration parameters with equal small values for all c in order for the learning to favor sparse solutions. The implementation uses the Jeffreys prior γ ′ c = 0.5 for all c. This prior value plays the role of pseudo-counts in the estimation of mixture weights v. Even if none of the study participants is related to, say, the cth mixture component, the variational distribution of mixture weights for a random individual will in effect be calculated as if 0.5 subject had actually had a response pattern related to this component. The use of the Jeffreys prior as pseudo-count is a theoretically well-defined way to keep the learning procedure open for the possibility that the next subject that could have been recruited from the same population might actually be characterized by response-model parameters drawn from the cth component.

C Model learning C.1 Total log-likelihood
We now denote the array of all ordinal rating data across subjects as y = (. . . , y n , . . .), and all nominal situation data as z = (. . . , z n , . . .). The corresponding array of all individual parameters is written ξ = (. . . , ξ n , . . .). The observations are considered conditionally independent across subjects, given the parameters, so the total log-likelihood is just a sum across individual log-likelihood functions.
Using the conditional log-likelihood for individual data in (A.3) and (A.6), together with the prior parameter densities in (B.1), (B.5), (B.6) and (B.9), the total log-likelihood of all observed data and all model 2 parameters is 3 (omitting irrelevant constants) ln p (y, z, ξ, µ, λ, ζ, v Here the first line represents the log-likelihood of observed nominal and ordinal EMA data, given the individual model parameters, the second line is the log probability density of individual model parameters as samples from the population GMM, and the remaining two lines specify the prior log probability density for the population-model parameters.

C.2 Variational inference
The model is trained from data with a variant of the standard Variational Inference (VI) procedure (e.g., Bishop, 2006, Ch. 10). A partially factorized density function q( ) is adapted to be a good approximation of the exact posterior density p( ) of all model parameters, given the observed data, as
The VI procedure maximizes a lower bound to the data log-likelihood 4 , and minimizes the Kullback-Leibler divergence KL (q ∥ p) = ln q (ξ, µ, λ, ζ, v)  p(ξ, µ, λ, ζ, v | y, z) between the approximate and exact posterior parameter distributions. The procedure is iterative and theoretically guaranteed to converge. Since (C.1) includes only a sum of terms across n for the individual parameters, and only a sum across c for the mixture components, while the individual and the population parameters are linked only by the latent indicators ζ, the variational distributions are naturally factorized without any further approximation, as Since the prior Gauss-gamma are conjugate distributions for the parameters of Gaussian mixture components, the variational q(µ cd , λ cd ) will naturally get the same Gauss-gamma form as the priors, without any further approximation. Similarly, the variational q(v) will also naturally become a Dirichlet density. However, the individual parameter distributions q(ξ n ) cannot be expressed as a member of a known distribution family. Therefore, the parameter density q(ξ n ) is approximated by a large number of equally probable sample vectorsξ ns = (. . . ,ξ nsd , . . .), generated by Hamiltonian sampling (Neal, 2011).

C.2.1 Individual variables
The variational distribution q(ξ, ζ) is obtained by averaging (C.1) across the distributions of other parameters, as ln q(ξ, ζ) = ln p (y, z, ξ, µ, λ, ζ, v) q(µ,λ,v) yielding ln q(ξ n , ζ n ) = const. + ln p(y n | ξ n ) + ln p(z n | ξ n ) The function L(ξ n ) on the first line is the log-likelihood of observed data, given the parameters. The functions l c (ξ n ) on the second line are the "logresponsibilities", i.e., non-normalized log probabilities that ξ n was generated by the cth mixture component. This yields a categorical distribution for ζ n , given ξ n , with normalized probability mass w c (ξ n ) = P(ζ nc = 1 | ξ n ). A properly normalized version of this conditional distribution is The marginal density q(ξ n ) is obtained from (C.8) and (C.9) as Here, the last term is the log likelihood that a candidate sample ξ n is drawn from any of the mixture components in the GMM. The Hamiltonian sampling procedure needs only the non-normalized part of ln q(ξ n ) defined in (C.10) and its gradient w.r.t. ξ n , which is straightforward to compute. The sampling must be repeated for each iteration of the variational procedure to account for the current estimate of the population model. The sampling is done separately for each participant model, so the computations for separate partipants can run in parallel processes. Now, using a set {ξ ns : s = 1, . . . , S} of equally probable sample vectors drawn from q(ξ n ), expectations of any function of the variables are consistently approximated simply by an average across all those samples. In particular, the following marginal means will be needed in other parts of the learning procedure:

C.2.2 Mixture weights
As (C.1) is a sum of terms involving ln v c , the variational mixture-weight distribution is defined by (C.14) using the marginal expected mixture weights ⟨ζ nc ⟩ from (C.11). Thus, the variational distribution again has the Dirichlet form, with concentration parameters γ = (γ 1 , . . . , γ C ) updated as

C.2.3 Gauss-gamma mixture components
As (C.1) is a sum across terms for every element in every mixture component, the variational distributions are defined by ln q(µ cd , λ cd ) + const. = As this is a second-degree polynomial in µ cd , by completing the square we find the logarithm of a conditional Gaussian density for µ cd , given precision λ cd , as ln q(µ cd | λ cd ) = const. + 1 2 ln λ cd − 1 2 (µ cd − m cd ) 2 ν c λ cd (C.18) Thus, the variational density for the component mean is again a conditional Gaussian with parameters updated as The variational density for the precision parameter is defined by This is again the logarithm of a gamma density q(λ cd ) ∝ λ ac−1 cd e −b cd λ cd (C.23) with parameters updated as

C.3 Log-likelihood lower bound
To monitor the progress of variational learning the lower bound (C.3) is most conveniently calculated at each iteration as This lower bound is theoretically guaranteed to be non-decreasing for each step of the learning procedure, except for minor random variations caused by the sampling approximation. Here the first two terms were already calculated by (C.10) during sampling. The third term is the entropy for each ξ n , which is calculated from the samples using a nearest-neighbour ("Kozachenko-Leonenko") estimator (Singh and Poczos, 2016). The last three terms subtract the Kullback-Leibler divergence KL (q ∥ p) between posterior and prior distributions for the four types of population parameters. The subtraction of Kullback-Leibler divergences represents the cost of model complexity which is the basis of the Occam's Razor effect. The variational learning automatically tends to push parameter distributions toward their priors for any mixture component that is not really needed to model the observed data, because this reduces the Kullback-Leibler divergence and increases L(q). When the learning procedure has finished, any unused mixture components may be deleted from the model with no loss of modeling accuracy.

D Predictive results
The learned population model is used to calculate two marginal distributions:

D.1 Population mean
The marginal distribution of the mean vector µ = (µ 1 . . . , µ D ) (equal to the median) is a mixture density, based on (C.19) and (C.15), integrated over the variational distributions (C.23) of the precision parameters.
Thus, the marginal density for the mean is a mixture including a univariate Student-t distribution for each element µ cd of each mixture component, with location m cd , scale b cd /a c ν c , and degrees-of-freedom 2a c .

D.2 Random individual
The predictive distribution p(ξ N ) for parameters ξ N = (. . . , ξ N d , . . .) for a future not-yet-seen individual randomly drawn from the modeled population, is a mixture density based on (B.2) and (C.15), integrated over the learned variational distributions for the Gaussian mean and precision parameters: Using the learned variational distributions (C.19) and (C.23) for the Gaussian component mean and precision, the resulting mixture includes a univariate Student-t distribution for each element ξ N d , given each mixture component with location m cd , scale b cd (ν c + 1)/a c ν c , and degrees-of-freedom 2a c .